SYDE 556/750: Simulating Neurobiological Systems

Terry Stewart

Online Learning

  • What do we mean by learning?
    • When we use an integrator to keep track of location, is that learning?
    • Probably not
    • What about the learning used to complete a pattern in the Raven's Progressive Matrices task?
    • Less clear
  • We'll stick with a simple definition of learning
    • Changing connection weights between groups of neurons
  • Why might we want to change connection weights?
  • This is what traditional neural network approaches do
    • Change connection weights until it performs the desired task
    • Once it's doing the task, stop changing the weights
  • But we have a method for just solving for the optimal connection weights
    • So why bother learning?

Why learning might be useful

  • We might not know the function at the beginning of the task
    • Example: a creature explores its environment and learns that eating red objects is bad, but eating green objects is good
      • what are the inputs and outputs here?
  • The desired function might change
    • Example: an ensemble whose input is a desired hand position, but the output is the muscle tension (or joint angles) needed to get there
      • why would this change?
  • The optimal weights we solve for might not be optimal
    • How could they not be optimal?
    • What assumptions are we making?

The simplest approach

  • What's the easiest way to deal with this, given what we know?
  • If we need new decoders
    • Let's solve for them while the model's running
    • Gather data to build up our $\Gamma$ and $\Upsilon$ matrices
  • Example: eating red but not green objects
    • Decoder from state to $Q$ value (utility of action) for eating
    • State is some high-dimensional vector that includes the colour of what we're looking for
      • And probably some other things, like whether it's small enough to be eaten
    • Initially doesn't use colour to get output
    • But we might experience a few bad outcomes after red, and good after green
    • These become new $x$ samples, with corresponding $f(x)$ outputs
    • Gather a few, recompute decoder
      • Could even do this after every timestep
  • Example: converting hand position to muscle commands
    • Send random signals to muscles
    • Observe hand position
    • Use that to train decoders
  • Example: going from optimal to even more optimal
    • As the model runs, we gather $x$ values
    • Recompute decoder for those $x$ values

What's wrong with this approach

  • Feels like cheating
  • Why?
  • Two kinds of problems:
    • Not biologically realistic
      • How are neurons supposed to do all this?
      • store data
      • solve decoders
      • timing
    • Computationally expensive
      • Even if we're not worried about realism
  • Note: these may be related points....

Traditional neural networks

  • What do they do?
  • Incremental learning
    • as you get examples, shift the connection weights slightly based on that example
    • don't have to consider all the data when making an update
  • Example: Perceptron learning (1957)
    • $\Delta w_j = \alpha(y_d - y)x_i$

  • Problems with perceptron
    • can't do all possible functions
    • Just linear functions of $x$
    • Is that a problem?

Backprop and the NEF

  • How is this problem normally solved?
    • Multiple layers

  • But now a new rule is needed
    • Standard answer: backprop
    • Same as perceptron for first layer
    • Estimate correct "hidden layer" input, and repeat
  • What would this be in NEF terms?
  • Remember that we're already fine with linear decoding
    • encoders (and $\alpha$ and $J^{bias}$) are first layer of weights, decoders are second layer
    • Note that in the NEF, we combine many of these together
  • We can just use the standard perceptron rule
    • as long as there's lots of neurons, and we've initialized them well with the desired intercepts and maximum rates, we should be able to decode
    • but, what might backprop do?

Biologically realistic perceptron learning

  • Simple learning rule: $\Delta d_i = \kappa (y_d - y)a_i$
  • How do we make it realistic?
  • Decoders don't exist in the brain
    • Need weights
  • $\omega_{ij} = \alpha_j d_i \cdot e_j$
  • $\Delta \omega_{ij} = \alpha_j \kappa (y_d - y)a_i \cdot e_j$
  • let's write $(y_d - y)$ as $E$
  • $\Delta \omega_{ij} = \alpha_j \kappa a_i E \cdot e_j$
  • $\Delta \omega_{ij} = \kappa a_i (\alpha_j E \cdot e_j)$
    • What's $\alpha_j E \cdot e_j$
    • That's the current that this neuron would get if it had $E$ as an input
    • but we don't want this current to drive the neuron
    • rather, we want it to change the weight
    • a modulatory input
  • This is the "Prescribed Error Sensitivity" PES rule (MacNeil & Eliasmith, 2011)

    • Any model in the NEF could use this instead of computing decoders
    • Requires some other neural group computing the error $E$
    • Used in Spaun for Q-value learning (reinforcement task)
    • Can even be used to learn circular convolution
      • only demonstrated up to 3 dimensions
      • why not more?
  • Nengo Examples:

    • learn_communicate.py
    • learn_square.py
    • learn_product.py
  • Is this realistic?

    • local information
    • Does it look like anything like this happens in the brain?
    • Dopamine seems to act as a gain on weight changes (maybe)
    • But are weight changes proportional to pre-synaptic activity?
      • Sort of

More complex learning

  • Hebbian learning

    • completely unsupervised
    • Neurons that fire together, wire together
    • $\Delta \omega_{ij} = \kappa a_i a_j$
    • just that would be unstable
      • Why?
  • BCM rule (Bienenstock, Cooper, & Munro, 1982)

    • $\Delta \omega_{ij} = \kappa a_i a_j (a_j-\theta)$
    • $\theta$ is an activity threshold
      • if post-synaptic neuron is more active than this threshold, increase strength
      • otherwise decrease it
    • Other than that, it's a standard Hebbian rule
    • Where would we get $\theta$?
      • need to store something about the overall recent activity of neuron $j$ so it can be compared to its current activity
      • Just have $\theta$ be a pstc-filtered spiking of $a_j$
    • Result: only a few neurons will fire
      • sparsification
    • What would this do in NEF terms?
      • Still represent $x$, but with very sparse encoders
    • This is still a rule on the weight matrix, but functionally seems to be more about encoders than decoders
      • What could we do, given that?

The homeostatic Prescribed Error Sensitivity (hPES) rule

  • Works as well (or better) than PES
    • Seems to be a bit more stable, but analysis is ongoing
  • Biological evidence?
    • Spike-Timing Dependent Plasticity

  • Still work to do for comparison, but seems promising
  • Error-driven for improving decoders
  • Hebbian sparsification to improve encoders
    • or perhaps to sparsify connections (energy savings in the brain, but not necessarily in simulation)